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Abstract 

Eigenvalue analyses of complex structures is a computationally intensive 
task which can benefit significantly from new and impending parallel compu- 
ters. This study reports on a parallel computer implementation of the Lanczos 
method for free vibration analysis. The approach used here subdivides the 
major Lanczos calculation tasks into subtasks and introduces parallelism down 
to the subtask levels such as matrix decomposition and forward/backward 
substitution. The method was implemented on a commercial parallel computer 
and results were obtained for a long flexible space structure. While parallel 
computing efficiency is problem and computer dependent, the efficiency for the 
Lanczos method was good for a moderate number of processors for the test 
problem. The greatest reduction in time was realized for the decomposition 
of the stiffness matrix, a calculation which took 70 percent of the time in 
the sequential program and which took 25 percent of the time on eight proces- 
sors. For a sample calculation of the twenty lowest frequencies of a 486 
degree of freedom problem, the total sequential computing time was reduced by 
almost a factor of ten using 16 processors. 
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Nomenclature 

decomposition of mass matrix 
diagonal matrix 
stiffness matrix 
lower triangular matrix 
mass matrix 

tridiagonal matrix of mth order 

Lanczos vector 

vector of degrees of freedom 
elements of tridiagonal matrix 

shift parameter 
eigenvalue 

frequency parameter 


Introduction 

The eigenvalue problem associated with free vibration analysis of complex 
structures is one of the more computationally intensive tasks in the design of 
modern aerospace vehicles. With more sophisticated vehicle designs and atten- 
dant detailed analyses, structural models composed of thousands of degrees of 
freedom are not uncommon and the associated free vibration analyses could 
require hours of computing time. 
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Recent and impending advances in computing capabilities indicate that 
effective computing speeds will soon approach 10 GIGA FLOPS (10 10 floating 
point operations per second) (ref. 1). Such computing speeds are being 
achieved by the development of innovative computer architectures consisting of 
arrays of processors operating in parallel. These parallel computing archi- 
tectures are also being scaled down to computers whose price range is less • 

than $1M and whose effective computer speeds approach 1 GIGA FLOPS. These 
trends indicate that such parallel computers can significantly improve the 
capability for large scale free vibration analyses. However, the hardware 
development of this new class of computers has surpassed the necessary soft- 
ware development needed to utilize fully the computational power available. 

The key to achieving peak performance is the modification of existing 
algorithms or the development of new methods tailored to parallel computers. 

There are many sophisticated, efficient sequential eigenvalue solvers now 
available (refs. 2-9). Parlett (ref. 9) discusses these methods and their 
effectiveness. In order to adapt these or other algorithms to a parallel 
computing environment these methods must be examined to identify the possi- 
bility for parallelism in the computation steps and how the methods' perform- 
ances can be improved by parallel computations. This paper focusses on one 
important method, the Lanczos method, (refs. 10-17) to develop a comprehensive 
parallel strategy for vibration analysis. The study builds on earlier work 
(ref. 17) with a parallel Lanczos method, introduces parallelism at lower 
levels of the program by subdividing tasks into subtasks and develops software 
capability to facilitate control of parallel eigenvalue calculations. 


Parallel Implications for Eigenvalue Methods 

Several of the available eigenvalue methods are roughly categorized as 
determinant methods, rotation methods and iterative methods (refs. 2, 4, 18- 
20) . The type of method used for a specific application depends in large part 
on the type of problem being solved, the size of the problem and the desired 
results. Sparse matrix operations, of the type encountered in structural 
problems, can provide special features which some algorithms do not address. 
The number of eigenvalues desired and their location in the overall spectrum 
may also require a specific strategy. Reference 9 indicates that, due to its 
efficiency, the Lanczos method is a key method for structural applications. 

In view of its growing acceptance, the Lanczos method was selected here for 
study for parallel implementation. 

The most comprehensive computer programs include several eigenvalue 
methods. An investigation of the various methods shows that many have certain 
calculation steps in common. These major calculation steps become areas of 
attention for reducing calculation time on a parallel computer. Since some 
methods, such as the Lanczos method, are mixed approaches, parallel algorithms 
which speed up calculations at the sub task level can be easily integrated into 
more than one method. 


Levels of Parallelism 

In dissecting a solution process for parallel implementation, one looks 
for those tasks that can be carried out concurrently with a minimum exchange 
of information or data dependencies. There are many levels or granularities 
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of parallelism. An example of a high-level, or coarse-grained, degree of 
parallelism is the concurrent execution of large sections of a program on a 
number of processors. Further decomposition can lead to subsequent levels of 
parallelism, from the highest level to the very- fine -grained parallel execu- 
tion of a few arithmetic operations simultaneously on several processors. 
Parallelism always introduces overhead and by examining the amount of overhead 
introduced, one can assess the benefits to be reaped and then decide when to 
further divide a task into subtasks. The development of parallel algorithms 
is complicated by the interactions of many processors operating concurrently. 
The algorithm must consider such things as: communication overhead, memory 

contention, critical regions which contain code that must be executed sequen- 
tially, the time to initiate tasks, data interdependency, and idle time 
resulting from an imbalance of the workload. Potential speedups may be 
relatively small at the lowest level of granularity, i.e., individual 
arithmetic expressions. This study investigates the level of granularity that 
will produce the most efficient performance. 


The Lanczos Method 

For the implementation in this study, the Lanczos method was used as 
representative of a multi-step process which lends itself to concurrent 
calculations. The basic Lanczos method is briefly summarized in Appendix A. 
The method seems well suited for the type of problems often encountered in 
vibration studies of large space structures. 

It is most efficient when solving for a few extreme eigenvalues of a very 
large system. This same property makes the method applicable to parallel 
processing since it is efficient to use several processors to compute a few 
values rather than to solve for many values in a single sequential solution. 

An implementation of the Lanczos method where parallelism is initiated at a 
high level of granularity by introducing shifts and having separate processors 
solve for eigenvalues in different areas of the spectrum is documented in 
reference 17. An example of the speedups obtained for a truss vibration 
example problem using this strategy is shown in figure 1. Speedup is defined 
as the time it takes to do a calculation on one processor divided by the time 
it takes on multiple processors. The theoretical speedup is the optimum 
achievable when all processors are operating at 100 per cent efficiency. 
Significant speedups were obtained for up to eight processors for the truss 
problem in figure 1. 

The next step in a parallel Lanczos implementation is to decompose the 
major tasks into subtasks and map them onto several processors. The sequen- 
tial order of steps in the basic Lanczos procedure is shown in figure 2. The 
steps include initialization, which includes the introduction of the shift, 
decomposition of the stiffness matrix, forward solution, back substitution, 
calculation of the Lanczos vectors and, finally, the solution of the resulting 
tridiagonal system of equations by the bisection method. By computing a 
finite number of vectors, a tridiagonal matrix is constructed that approxi- 
mates the eigenvalues of the original large problem. The parallel strategy is 
to assign sub tasks of these major tasks to available processors as needed. 

An implementation of the parallel strategy was carried out on a Flexible 
Computer Multi -computer FLEX/32. 
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The FLEX/32 is a multiple -instruction multiple -data (MIMD) computer with 
both shared and local memory (ref. 21). The FLEX/32 configuration used in 
this study consists of 20 processors. Two processors are dedicated to program 
development and 18 are dedicated to parallel processing. 

Although, in theory, all computed eigenvalues would be good approximations 
to the actual answers, in practice this does not hold true. Due to the loss of 
orthogonality in the computed vectors, only some results can be accepted. 
Reference 11 clarifies orthogonalization issues; methods for best determining 
acceptable eigenvalues is a subject of ongoing research (refs. 4, 9, 11, 14). 

An extensive discussion of the method and reorthogonalization procedures can be 
found in reference 14. 

Some approaches include total or selective reorthogonalization with 
respect to previously calculated vectors. Another less computationally 
expensive approach recommended by Cullum and Willoughby (refs. 14, 17) provides 
an easy test for selecting valid eigenvalues from those obtained through 
Lanczos calculations. In this study, the Cullum and Willoughby test is used 
because in comparison with reorthogonalization, it was found to be reliable and 
efficient. 


Space Mast Problem 

The test problem used in this study is the 60-meter three -longeron truss 
structure shown in figure 3 attached as a mast to the space shuttle orbiter 
with an antenna attached. For this study, the mast is considered rigidly 
connected to the orbiter and the antenna is not considered. The mast is 
composed of extensional members and masses are concentrated at the nodes. The 
details of the mast and node members are shown in the figure where the mast is 
organized according to major substructures. For this problem, there are three 
degrees of freedom at each unconstrained node resulting in a model of 486 
degrees of freedom. 

Representative vibration results for the space mast are shown in figure 4 
and the sequential times for each Lanczos step on a single processor are shown 
in figure 5. For the test problem, the decomposition of the stiffness matrix K 
into the product of a diagonal matrix D and a lower triangular matrix L took 70 
percent of the solution time. As the problem size grows, the time taken to 
decompose the stiffness matrix grows accordingly. For this algorithm, the 
decomposition is done once for each shift value. In a nonlinear analysis, the 
decomposition would be done at every time step, making any reduction in calcu- 
lation time even more meaningful. Since this step seemed to offer the most 
benefits, it was incorporated into the parallel code first. 


Parallel Implementation 

The first strategy in parallelizing the decomposition step is shown in 
figure 6 which depicts a four-processor implementation. This strategy repre- 
sents a pre-scheduled assignment of tasks done in a row interleave fashion. 

Each processor p is assigned the calculation necessary for rows p, p+n, p+2n, 
..., where n is the number of processors. The computed values are stored in 
shared memory where they are accessible to all processors. The processors must 
be synchronized to ensure that values needed for the next step are already 
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computed by the assigned processors and stored. In this implementation, the 
synchronization accounts for most of the overhead associated with the 
parallelism. 

For comparison, a second strategy was adopted to parallelize the 
decomposition step. In this case, a queue of tasks is set up in shared memory 
and each processor takes its next work assignment from this queue. The timing 
for this self- scheduled strategy was almost identical to the pre- scheduled task 
assignment. Timings given in the following figures were obtained using this 
self- scheduled task assignment. One advantage of the self- scheduling is that 
if one processor should fall behind in its calculation or experience hardware 
problems, the other processors would continue the calculations. 

The timing results for decomposing the n x n matrix, where n is equal 
to 486, into a lower triangle matrix and a diagonal array by using up to 16 
processors are shown in figure 7. The speedups are shown in figure 8. Signifi- 
cant speedups are obtained for up to eight processors for the decomposition of 
the stiffness matrix. Using more than eight processors for this problem did 
not result in significant time reduction. The size of the matrix and particu- 
larly, the size of the bandwidth determines the amount of calculation in the 
decomposition step. As the order of the matrix approaches infinity, the number 
of arithmetic operations in the decomposition step approaches 1/2 (n) (s) 

(s+1) , where s is the semi -bandwidth and n is the order of the matrix. 

Since the matrix is banded, the calculation of the zero elements is ignored. 
This means that the work becomes equal on each processsor when the assigned row 
number is equal to or greater than the semi -bandwidth, which in this case is 
18. When 16 processors are working concurrently on the decomposition of the 
same matrix, each processor must wait for 15 others to compute a needed value 
at each step. 

The forward solution and back substitution steps (fig. 9) were also 
parallelized in a row interleave fashion. One of the decisions that has to be 
made when using the Lanczos algorithm is to determine the order of the 
resulting tridiagonal matrix. This order represents the number of times the 
forward solution, back substitution and vector calculation steps are carried 
out. If the order is m, then m eigenvalues will be found. Not all of these 
eigenvalues are valid approximations to the eigenvalues of the original 
problem, since redundant and/or spurious values may appear (refs. 14, 17). One 
rule of thumb is to make m twice as large as the number of eigenvalues 
desired. 

To study the effect of the choice of m, various values were used to 
calculate the eigenvalues of the space mast problem. A plot of the time for 
computing one acceptable eigenvalue is shown in figure 10 for three values of 
m. It was found that an m equal to 30 gave the most information for the 
least time. The plot shows about the same time for m equal to 16 as for m 
equal to 30. However, in addition to the acceptable values, approximate values 
are obtained for additional eigenvalues for the larger value of m. The first 
25 eigenvalues obtained for m equal to 30 are given in table 1. Tests were 
then made to eliminate the multiple and spurious eigenvalues. The acceptable 
values are marked with an asterisk. These values compare favorably to values 
found using an independent structural analysis code (ref. 22). 

Timing results for the forward solution with m equal to 30 are shown in 
figure 11. This step consists of much less computation time compared to 
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synchronization time than in the decomposition step. The speedups shown in 
figure 12 show a gain on up to four processors. For this size problem, there 
is actually a decrease in the speedup when using eight processors for the 
forward solution step. 

The amount of computation in the back substitution step is less than in 
the forward solution step. Timimg results shown in figure 13 indicate that 
more than four processors dedicated to this step for this problem would be 
ineffective. 

A plot of the execution time versus the number of processors for the 
overall Lanczos method is shown in figure 14. A plot of the speedups for the 
overall method is shown in figure 15. The execution time for the various steps 
on several processors is shown in table 2. The decrease in execution time is 
due only to the parallelization of the decomposition, forward solution, and 
back substitution steps. Steps which have not been done in parallel include: 
initialization, calculation of the Lanczos vectors and the bisection. For this 
problem these steps took a relatively small amount of time and implementation 
of parallel processing is only of minor benefit. 

To provide a measure of the total efficiency of a parallel Lanczos method, 
timing results are shown in table 3 for the test problem where four shifts are 
introduced and shift calculations are assigned to four separate sets of proces- 
sors. An average of five valid eigenvalues was obtained for each shift region. 
The table shows that it takes 876 seconds to run the sequential program using 
four different shifts. When the four shifts are each assigned to a set of four 
processors for a total of 16 processors, the time is reduced to 89 seconds. 
Timing reductions resulting from a differing number of processors assigned to 
each shift are shown in figure 16. 


Concluding Remarks 

This study has focussed on a parallel computer implementation of the 
Lanczos method for free vibration analysis. The approach used extends previous 
work by subdividing the major Lanczos calculation tasks into sub tasks and 
introducing parallelism at the subtask level. 

Results were obtained for a long flexible space structure test problem and 
the method was implemented on a commercial parallel computer. The results of 
the study indicate the Lanczos method is a promising method for providing 
calculation speedups when tasks are subdivided and assigned to several proces- 
sors. Since the method is most efficient when solving for a few eigenvalues at 
the extreme ends of the spectrum, several processors can be working concur- 
rently in different areas of the eigenvalue spectrum. Subdividing the Lanczos 
tasks among processors provides the best use of the parallel resources. The 
decomposition step is the most time-consuming calculation step for large 
problems when only a few Lanczos iterations are performed. The results show 
that for this test problem, using eight processors for the decomposition of the 
stiffness matrix was the optimum. In subtasks where the calculation time is 
relatively short with respect to the synchronization time, such as in the 
forward and backward solution steps, no more than eight processors were 
effective. The efficiency is problem dependent and the number of processors 
must be tailored to the specific application. 
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By assigning a set of processors to an interval in the eigenvalue 
spectrum, eigenvalues can be obtained faster than on a sequential computer of 
the same speed. A combination of coarse-grained parallelism at the program 
level and fine-grained parallelism at the task level results in significant 
reductions in solution time. Parallel processors offer the structural analyst 
an effective tool for solving large structural problems in a timely fashion. 


Appendix A 


Lanczos Eigenvalue Method 

The following briefly outlines the Lanczos algorithm strategy as 


implemented in this study. Consider the 
KX - w 2 MX 

where X is the displacement vector, K 

2 

mass matrices, respectively, and w is 
eigenvalue shift parameter 6 such that 

2 r A - 2 

w — 8 + w 

and decompose M into 
T 

M - BB 

where B is a lower triangular matrix. 

Equation (Al) is then transformed to 
AY - AY 


eigenvalue problem 

(Al) 

and M are symmetric stiffness and 
the frequency parameter. Introduce an 

(A2) 

(A3) 

the Lanczos format (ref. 7, 9-19) 

(A4) 


where 


A - B T [K' 1 ] B Y - B T X (A5) 

and 

A - K - K - S M (A6) 

0 ) 

In equation (A4) , A is symmetric since K is symmetric. The solution 
to equation (A4) by the Lanczos algorithm yields increasingly good 
approximations to the largest eigenvalues A which correspond to the smallest 
-2 2 

u> and the closest values to a reference value 8 . If a sufficient number 

of Lanczos cycles are carried out, the method will theoretically yield all 
eigenvalues . 

The Lanczos transformation is 


K - 
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where L is a unit triangular matrix and D is a diagonal matrix. Equation 
(A12) results in 

AV t - B T L'^LD)'- 1 ^ (A14) 

or 

(B 1 !/ 1 )' 1 AV £ - (LD)' 1 BV i - a ± (A15) 


where a^ is an intermediate vector. Equation (A15) gives 



LD a i - BV i 

(A16) 

and 

T -T --1 

E L a ± - K V ± 

(A17) 

Letting 

l- t n - b t 

(A18) 

results in 

L \ - . t 

(A19) 

and 

AV £ - B 1 ^ 

(A20) 


Equations (A16) , (A19) , and (A20) become the equations for determining 
AV^ . The sequence for determining the uses an arbitrary starting vector 

(e.g., V q - (1, 0, 0, 0, . . . )) and then calculates AV^ from equations 

(A16), (A19) , and (A20) . The remaining steps to obtain a^, 0^ and are 

based on the Gram- Schmidt orthogonalization as follows 


W i “ AV i 
i - 1. 2, . 


^i-1 V i-1 where P 0 “ 0 


T 

V W 
i i 


W, 


°i V i 


(A21) 

(A22) 

(A23) 


NOTE: 


For reorthogonalization insert: c^ 


C i - ^ V J c i V j 


(A23a) 


5 i “ t C i C i> 


(A24) 


v i + i - k °i 


(A25) 
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The cycles are repeated to produce a set of m vectors V. , . . V (m < 

I m 

n) and associated and 0^. The eigenvalues may be obtained at any stage 

for the m by m equation 

TQ - AQ (A26) 

where T is composed of the and fi ^ calculated to that point. 


The eigenvalues of equation (A26) can be obtained by the Sturm sequence 
and bisection methods. Let the principal minors of equation (A26) be denoted 


hd 

o 

1 

P 1 (A) - - A 

(A27) 

PjO) - 

(a i' A)P i-l (A) ' P i-2 (A) 


(i ~ 2, 

. . . , m < n) 

(A28) 


Direct calculation of these sequences can result in overflow or underflow; the 
following sequence avoids the need to rescale. 


P.(A) 

*i(*> " p~ 


i-1 


(A) 


to give 


q o “ 1 ’ q i " “i ' A 


p i_l 

q i< A) - 


(A29) 


(A30) 


The Sturm sequence property for equations (A27) and (A28) is that for a 
specific value of A*, the disagreements in sign between consecutive numbers 
of the sequence P^(A*) is equal to the number of eigenvalues smaller than 

A*. 


In the bisection method, the Sturm sequence property is used to restrict 
the interval in which a particular eigenvalue must lie, until the eigenvalue is 
predicted to sufficient accuracy. An interval in question is halved and the 
Sturm sequence calculated to determine the number of eigenvalues in the two 
intervals. The interval containing the largest eigenvalue is subsequently 
halved and the process repeated until the largest eigenvalue has been isolated 
to sufficient accuracy. This approach is then applied to determine the next 
lower eigenvalues in order of the relative size. When in the neighborhood of 
an eigenvalue, it is often more efficient to switch from the bisection method 
to an interpolation scheme; the bisection method is also well suited for 
implementation on a parallel computer as any number of eigenvalues can be 
calculated in parallel. 
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For a given eigenvalue A R the associated eigenvector of equation (A26) 
Q r is found from equation (A26). The corresponding eigenvector of (A4) Y R 
is then given by 

\ - V Q r (A31) 

and the for equation (Al) can be found from 

fiT *K “ \ 

The eigenvalue is transformed to the desired frequency parameter by 
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Table 1 Multiprocessor Eigenvalue Approximations 


Eigenvalue shift - 0.0 

Eigenvalues calculated before 
elimination of multiplicities 


Order of trl diagonal matrix - 30 

NICE/SPAR results 
(ref. 22) error 


10.04 * 

10.04 

10.04 

10.04 

10.04 

10.04 

10.04 

4x10-10 

11.74 * 

11.74 

11.74 

11.74 

11.74 

11.74 

11.76 

6x10-9 

380.2 * 

380.2 

380.2 

380.1 

1x10-3 

442.0 * 
442.0 

443.8 

5x10-3 

766.3 

766.3 

766.3 

1001. 

1x10-1 

2876. * 

2835. 

2x10+1 

3649. 

3275. 

2x10+1 

7890. 



7890. 




18835. 


* accepted as valid approximation 


Table 2 Time to solve for twenty valid eigenvalues 


Number 

of 

shifts 

Number of processors 
assigned to each 
shift region 

Total number of 
processors 

Time 

(seconds) 

Speedup 

4 

1 

1 

876 

1 

4 

1 

4 

227 

3.9 

4 

2 

8 

130 

6.7 


4 


4 


16 


87 


9.8 













Processor 



. 4 Representative nods shapes of 60 -meter nest. 


1 

2 

3 

4 
1 
2 
3 


L 21 °2 

L 31 L 32 °3 

L 41 L 42 L 43 °4 

Si S2 S3 L 54 °5 

L 61 S2 S3 L 64 S5 °6 

L 71 L 72 L 73 L 74 L 75 L 76 °7 

Si L 82 S3 L 84 L 85 L 86 S7 °8 

Si S 2 S3 L 94 L 95 L % S7 S8 °9 


l i: = 


D. = 


v&vvv 

IT 


1 




Lancos method operation * Time in seconds 

1. initialization 

*=K-6M 5 

2. Decomposition 

K = LDL T 156 

3. Forward solution 

l Da = V 32 

4. Back substitution 

l T y = a I? 

5. Calculation of Lancoz vectors 

a, p, V 6 
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Fig. 5 Execution time for sequential 
Lanczos method. 


Fig. 6 Four-processor implementation of LDL T matrix 
decomposition. 



Fig. 7 Execution time for n x n matrix 
decomposition. 



Fig. 8 Speedup for n x n matrix decomposition. 
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